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(54) Method and apparatus for providing scaleable levels of application availability 



(57) A method and an apparatus for providing scal- 
able layers of highly available applications using loosely 
coupled commercially available computers. The soft- 
ware running on the loosely coupled computers is divid- 
ed into three layers: the system layer, the platform layer, 
and the application layer, each having its own process 
group activation and fault recovery strategy. A process 
group contains software processes that depend upon a 
set of resources common to the process group. In addi- 
tion to depending upon a common set of resources, 
processes within a process group share a fault recovery 
strategy. Fault recovery is performed at the process 
group level, such that if one process within a process 
group fails, fault recovery is takes place for all processes 
within the process group. In the preferred embodiment, 
an application layer process group may be paired with 
another application layer process group on a separate 
computer. As part of certain escalated process group 
fault recovery strategies, upon taking an application lay- 
r process group out of service : its paired application 
layer process group, if any xists, takes over p rforming 
the functions of the process group that was taken out of 
service. 
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D scripti n 

Field f the Invention 

[0001] The present invention relates to computer sys- 
tem architectures, and more particularly to a reliable 
cluster computing architecture that provides scalable 
levels of high availability applications simultaneously 
across commercially available computing elements. 

Statement of Related Art 

[0002] Prior art high availability clustered computer 
systems are typically configured in an architecture hav- 
ing shared physical storage devices, such as a shared 
disk. Therefore, prior art cluster offerings are typically 
based on physical hardware, or clustered arrangements 
of systems and storage, particularly adapted to a unique 
application processing environment. In a common type 
of prior art high availability cluster, all of the critical ap- 
plication data must reside on an external shared disk, 
or on a pool of disks, that is accessible from at most one 
computing system in the cluster. Such a prior art cluster 
tries to isolate access to the data partitions on the disk 
so that access to the shared disk is limited to only one 
computing system at a time. Upon failure of the primary 
computing system, a takeover occurs whereby the high 
availability cluster reallocates access to the disk from 
the primary computing system to the dedicated backup 
system. Once such a reallocation is performed, the ap- 
plications on that backup system will have access to the 
disk. 

[0003] Another prior art high availability cluster solu- 
tion is a multi-processor cluster Like the shared-disk 
cluster, the multi-processor cluster is a hardware-based 
cluster arrangement of computing systems. Unlike the 
shared-disk cluster, in which the computing systems are 
essentially unrelated to each other, the computing sys- 
tems in a multi-processor cluster are ail running the 
same application and using the same data at virtually 
the same time. All physical storage is configured to be 
accessible to all computing systems. Such multi-proc- 
essor clusters, in an attempt to control access to con- 
current data, typically use lock management software to 
manage access to data and prevent any data corruption 
or integrity problems. The loss of a computing system 
from a multi-processor cluster allows the remaining sys- 
tems to continue processing the data. 
[0004] Another prior art high availability cluster solu- 
tion is a symmetrical multi-processing, or scalable par- 
allel processing, cluster based on a shared memory or 
system bus architecture where the memory is common 
to multiple computing systems. Such systems, in an at- 
tempt to improve performance by scaling the number of 
computing systems in th symmetrical multi-processing 
cluster, allow a single computing system failure to cause 
the entire symmetrical multi-processing or scalable par- 
allel processing cluster platform to become unavailable 



[0005] Yet another high availability cluster architec- 
ture is a multiple parallel processor cluster, in which 
each computing system has its own memory and disk, 
none of which are shared with any other computing sys- 

5 tern in the cluster. If one system has data on a disk, and 
that data is required by another computing system, the 
first computer sends the data over a high speed network 
to the other computing system. Such multiple parallel 
processor clusters, in an attempt to improve perform- 

io ance by allowing multiple computing systems to work 
concurrently, allow data associated with a lailed com- 
puting system to become unavailable. 
[0006] The prior art high availability clusters, in trying 
to provide different levels of availability, have used op- 

is erating system-based clusters to optimize the unique 
data and application characteristics for a specific target- 
ed commercial market. Such a targeted approach does 
not lend itself well to certain industries, including tele- 
communications, in which numerous legacy applica- 

20 tions currently exist, each with unique recovery and per- 
formance characteristics running on proprietary hard- 
ware, some of which is fault tolerant. 
[0007] Therefore, a computing system architecture 
that provides varying levels of high availability applica- 

25 tions simultaneously across one or more loosely cou- 
pled commercially available computing elements using 
a commercially available interconnect is desirable. 
[0008] The prior art high availability cluster solutions 
have the capability to support "heartbeats" and recovery 

30 of a specified application. The most significant architec- 
tural difference between the prior art solutions is the 
method for determining how an application and/or com- 
puting system is chosen or controlled to be active or 
standby and the method for determining when they will 

35 be allowed access to the application data Typical phys- 
ical high availability cluster solutions determine the sta- 
tus of the configuration via a set ol redundant commu- 
nication facilities between the pair of computing sys- 
tems. Under most circumstances, the paired systems 

40 are able to determine which system is active for an ap- 
plication. 

[0009] In prior art high availability solutions, when all 
communication is lost between computing systems, the 
computing systems or clustered applications might each 

45 take on an active role believing that the other has failed. 
Such a situation presents an undesirably high risk of ap- 
plication data and processing being corrupted. Several 
added levels of protection and safely are possible to pre- 
vent that from happening. Some solutions in the prior 

50 art, nearly eliminate this risk using heartbeats through 
the shared storage. Since certain cluster solutions do 
not need to use shared storage, a platform neutral hard- 
ware component is desirable to complement the soft- 
ware-based cluster components. It is therefore an object 

55 of this invention to provide scaleable layers of highly 
available application processes using loosely coupled 
commercially available computing elements. 
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Summary Of The Invention 

[0010] This invention provides a method and an ap- 
paratus for providing scaleable layers of high availability 
applications using loosely coupled commercially avail- 
able computing elements, also referred to as comput- 
ers. Computing elements refers to any type of processor 
or any device containing such a processor. 
[0011] Resource dependencies and fault recovery 
strategies occur at the process group level. For exam- 
ple, a process group containing three processes might 
depend upon four resources, such as other process 
groups or peripheral devices, such as a disk. Upon fail- 
ure of a single process within the process group or upon 
failure of a single resource depended upon by the proc- 
ess group, fault recovery will be initiated for the entire 
process group, as a single unit. 

[001 2] Process groups can belong to one of three lay- 
ers: the system layer, the platform layer, or the applica- 
tion layer. In the preferred embodiment, each layer has 
a unique process group activation and fault recovery 
strategy. In the preferred embodiment, an application 
layer process group may be paired with another appli- 
cation layer process group on a separate computer. As 
part of certain escalated process group fault recovery 
strategies, upon taking an application layer process 
group out of service, its paired application layer process 
group, if any exists, takes over performing the functions 
of the process group that was taken out of service. 
[001 3] Application layer process groups depend upon 
one or more platform layer process groups, which de- 
pend upon one or more system layer process groups, 
which depend upon the hardware of the loosely coupled 
computer hosting the process groups. 
[0014] Upon a system layer process group failure, all 
process groups on the host computer are taken out of 
service, which includes activating on another computer 
or computers any application layer process group that 
is paired with an application layer process group taken 
out of service, the computer hosting the failed system 
layer process group is re-booted, and all system layer, 
platform layer, and application layer process groups are 
re-initialized. 

[0015] Upon a platform layer process group failure, 
the platform layer process group may be re-started zero 
or more times. If re-starting the failed platform layer 
process group does not cure the platform layer process 
group failure or if the platform layer process group is not 
restartable, all application layer and platform layer proc- 
ess groups on the host computer are taken out of service 
and re-initialized, which includes activating on another 
computer or computers any application layer process 
group that is paired with an application layer process 
group taken out of service on the host computer. 
[0016] Upon failure of a resource depended upon by 
a platform layer process group, all application layer and 
platform layer process groups on the host computer are 
taken out of service and re-initialized, which includes ac- 



tivating on another computer or computers any applica- 
tion layer process group that is paired with an applica- 
tion layer process group taken out of service on the host 
computer. 

5 [0017] Upon failure of an application layer process 
group, the failed application layer process group may 
be restarted zero or more times. If restarting the failed 
application layer process group does not correct the ap- 
plication layer process group failure or if the failed ap- 

io plication layer process group is not restartable, then the 
failed application layer process group is taken out of 
service, which includes activating on another computer 
the application layer process group, if any, that is paired 
with the application layer process group taken out of 

15 service. 

[0018] Upon failure of a resource depended upon by 
an application layer process group, the dependent ap- 
plication layer process group is taken out of service, 
which includes activating on another computer the ap- 
20 plication layer process group, if any, that is paired with 
the dependent application layer process group taken out 
of service on the host computer. 

Brief Description Of The Drawings 

25 

[0019] FIG. 1 is a block diagram illustration of the sub- 
ject invention, including four loosely coupled computers, 
an independent computer and a maintenance terminal. 
[0020] FIG. 2 is a state diagram illustrating the se- 
30 quence of states that a process group in the subject in- 
vention can transition through. 

Detailed Description Of The Preferred Embodiment 

35 [0021] FIG. 1 shows the preferred embodiment of the 
subject invention, including a maintenance terminal 
(MT) 2, an independent computer (IC) 4, four industry 
standard commercially available computing elements 6 
(also referred to as computers 6) loosely coupled to- 

40 gether through an interconnect 8, such as a network, a 
computer bus architecture, and the like. Computing el- 
ements 6, also referred to as computers 6, could be any 
type of processor or any device containing such a proc- 
essor. 

4 5 [0022] Each computer 6 is running process group 
management software 10, also referred to collectively 
as the process group manager. The process group man- 
ager 10 activates process groups and initiates fault re- 
covery strategies at the process group level. A process 

so group is a group of processes, which are typically im- 
plemented in software, that are related to each other in 
some way such that it is desirable to manage the proc- 
ess group as a single unit. It may be desirable to restart 
all of the processes within a process group together, or 

55 it may be desirabl , as part of an escalated fault recov- 
ery strategy, to have the functionality of all of the proc- 
esses in the process group performed by a proc ss 
group on a separate computer. The process group 
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might, but does not have to, depend upon a resource, 
or a set of resources common to the process group. 
[0023] The independent computer 4 is preferably a 
computing device designed to have a minimal number 
of faults over an extended period of time. The independ- 
ent computer 4 monitors computers 6 for hardware 
faults using heartbeats, as disclosed in commonly as- 
signed U.S. Patent No. 5,560,033. Each of the loosely 
coupled computers 6 is coupled to the centralized com- 
puter 4 as shown with reference numeral 12 in FIG. 1 . 
[0024] Process groups may belong to one of three lay- 
ers: the system layer 13, the platform layer 15, and the 
application layer 17. Each computer 6 is shown running 
process group management (PGM) software 10, one 
system layer process group (SLPG) 14, one platform 
layer process group (PLPG) 16, either one or two appli- 
cation layer process groups (ALPG) 18 that are not of a 
process-group pair (PGP) 24, one primary application 
layer process group (P-ALPG) 20 that is part of a proc- 
ess-group pair 24, and one alternate application layer 
process group (A-ALPG) 22 that is part of a process- 
group pair 24. By definition: (1) each process-group pair 
24 contains one primary process group 20 on a first 
computer 6 and one alternate process group 22 on a 
second computer 6; (2) each primary process group 20, 
and each alternate process group 22, is part of a proc- 
ess-group pair; and (3) process groups 14, 16, and 18 
are not part of a process-group pair 24. In the preferred 
embodiment, process-group pairs may belong only to 
the application layer 17; however, process-group pairs 
could be provided at the system layer 13, the platform 
layer 1 5, or the system layer 1 3, or at all of these layers, 
without departing from the scope of this invention. 
[0025] Although each computer 6 in the preferred em- 
bodiment is running the number of certain types of proc- 
ess groups mentioned above; without departing from 
the scope of this invention, each computer 6 could be 
running: (1) zero or more process groups 14, 16, and 
1 8; (2) zero or more primary process groups 20; and (3) 
zero or more alternate process groups 22. 
[0026] Similarly, the number of computers 6 can be 
two or more without departing from the scope of this in- 
vention, and, although the process group management 
software 10 is shown running on all four of the loosely 
coupled computers 6, it could be running on any permu- 
tation or combination of the loosely coupled computers 
6 and/or the independent computer 4 without departing 
from the scope of this invention. Further, the functions 
performed by the independent computer 4 could be per- 
formed using any type of device capable of running any 
type of processor such as a peripheral board with a dig- 
ital signal processor, a system board with a general pur- 
pose processor, a fault tolerant processor and the like, 
without departing from the scope of this invention. 
[0027] The independent computer 4 uses industry- 
standard interfaces 12, such as RS-232 in the preferred 
embodiment, although any interface could be used. This 
provides the capability of leveraging gen ral networking 



interfaces, such as Ethernet, Fiber Distributed Data In- 
terface (FDDI), Asynchronous Transfer Mode (ATM), 
and protocols, such as Transmission Control Protocol/ 
Internet Protocol (TCP/IP), or Peripheral Component In- 

5 terconnect (PCI). 

[0028] Each loosely coupled computer 6 typically con- 
tains a uni-processor, mu It i- processor, or fault tolerant 
processor system having an operating environment that 
is the same as, or different from, the operating environ- 

io ment of each of the other loosely coupled computers 6. 
In other words, separate computers 6 can run different 
operating systems, for instance, WINDOWS as op- 
posed to UNIX , different operating environments, for in- 
stance, real-time as opposed to non-real-time, and have 

is different numbers and types of processors. Each of the 
loosely coupled computers 6 can be either located at 
the same site or geographically separated and connect- 
ed via a network 8, such as a local area network (LAN) 
or a wide area network (WAN). 

20 [0029] As previously mentioned, each process group 
14, 16, 18, 20, or 22 contains one or more processes 
that may depend on a set of resources common to the 
process group 14, 16, 18, 20, or 22. For example, a set 
of such resources could include a computer hardware 

25 peripheral device, or another process group 14, 16, or 
18, 20, or 22, a communication link, available disk 
space, or anything that might affect the availability of an 
external application. Each alternate process group 22 
depends upon a set of resources that is functionally 

30 equivalent to the set of resources depended upon by the 
alternate-process-group's paired primary process 
group 20. The set of resources depended upon by a pri- 
mary process group 20 and the separate set of resourc- 
es depended upon by its paired alternate process group 

35 22 do not, however, have to contain the same number 
of resources. Each of the processes contained within a 
process group 14, 16, 18, 20, or 22 also has an activa- 
tion and fault recovery strategy common to the process 
group 14, 16, 18, 20, or 22. 

40 [0030] FIG. 2 shows the states through which every 
process group 14, 16, 18, 20, and 22 may transition, 
namely, Unavailable (Unavail) 30, Initialization (Init) 32, 
Standby 34, Active 36, and Off-line 38. Process groups 
in the Unavailable 30 and Initialization 32 states have 

45 not been started and, therefore, are not running. Proc- 
ess groups in the Active state 36 have been started and 
are running. Whether a process group in the Standby 
state 34 has been started and is running depends upon 
whether the process group is a hot-standby or a cotd- 

50 standby process group. Hot-standby process groups in 
the Standby 34 state have been started and are waiting 
to be activated. Cold standby process groups in the 
Standby 34 state are not started until they are activated. 
Activation of cold standby process groups may also in- 

55 volve initializing any uninitialized resources depended 
upon by the Cold-standby process group. In a non-fault 
condition, primary process groups 20 are initialized to 
run in the Active state 36, and alternate process groups 
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22 are initialized to the Standby state 34. 
[0031] In the preferred mbodiment, the Off-line state 
38 can only be entered and exited by manual operation. 
In other words, a human operator must enter a com- 
mand from the local maintenance terminal 2 to put one 
or more process groups 14, 16, 18, 20, and 22 into the 
Off-line state 38 or to remove one or more process 
groups 14, 16, 18, 20, and 22 from the Off-line state 36. 
The Off-line state 38 may be entered under circum- 
stances other than by manual operation, such as upon 
a command from the process group manager 1 0, with- 
out departing from the scope of this invention. 
[0032] Assuming there are no resource or process 
group faults and neither the primary nor the alternate 
process groups 20 and 22 have been manually transi- 
tioned to the Offline state 38, process-group pairs 24 
contain a primary process group 20 and an alternate 
process group 22 in an Active 36/Standby 34 paired re- 
lationships: active/cold-standby or active/hot-standby. 
In the preferred embodiment, the primary process group 
20 is initialized to the Active state 36 and the alternate 
process group 22 is initialized to the Standby state 34. 
[0033] Although primary process groups 20 are initial- 
ized to Active 36 and alternate process groups 22 are 
initialized to Standby 34, under certain conditions, an 
alternate process group 22 can be Active 36 while its 
paired primary process group 20 is Standby 34. For ex- 
ample, if a fault occurs in a primary process group 20 
contained in a process-group pair 24 having an Active 
36/Standby 34 paired relationship, the process group 
manager 10 will transition that primary process group 
20 from Active 36 to Unavailable 30 and transition the 
alternate process group 22 from Standby 34 to Active 
36. Once the fault that caused the primary process 
group 20 to be transitioned to Unavailable 30 has been 
corrected, the primary process group 20 will be transi- 
tioned to Standby 34 until some event, such as an alter- 
nate process group fault occurs to cause the alternate 
process group 22 to transition to Unavailable 30, at 
which time the primary process group 20 will be transi- 
tioned from Standby 34 to Active 36. A process-group 
pair's primary and alternate process groups 20 and 22 
can be switched from Active 36/Standby 34 to Standby 
34/ Active 36 manually or unoer circumstances in which 
switching the Active 36 proc*. : s group 20 or 22 to Stand- 
by 34 and the Standby 34 process group 20 or 22 to 
Active 36 is desirable. 

[0034] The availability of an application layer process 
group 18, 20 or 22 typically depends upon the availabil- 
ity of one or more platform layer process groups 1 6. The 
availability of a platform layer process groups 16 typi- 
cally depends upon the availability of one or more sys- 
tem layer process groups 14, and the availability of sys- 
tem layer process groups 14 depends upon the availa- 
bility of the hardware of the computer 6 hosting the sys- 
tem layer process groups 14. System layer process 
groups 14 are initialized befor platform layer process 
groups 16, which are initialized before application layer 



process groups 18, 20, and 22. 

[0035] This invention provides the flexibility to imple- 
ment external applications using various numbers of 
process groups 14, 16. 18, 20. and 22, and/or process- 
s group pairs 24 spread across two or more computers 6. 
For instance, the four process-group pairs 24 shown in 
FIG. 1 could be part of one external application, or they 
could be part of four, or three, or two, separate external 
applications. In addition, two of the process group pairs 
io 24 shown in FIG 1 could be part of the same external 
application, yet be hosted by two separate pairs of com- 
puters 6. Further, one or more process-group pairs 24 
and/or one or more process groups 14, 16, 18, 20, or 
22 may depend upon the same resource as one or more 
15 other process-group pairs 24 and/or one or more proc- 
ess groups 1 4, 1 6, 1 8, 20, or 22, such that the failure of 
a single resource could cause a plurality of process 
groups' and/or a plurality of process-group pairs' fault 
recovery strategies to be performed. 
20 [0036] Each layer, system 13, platform 15. and appli- 
cation 17, has a unique process group activation and 
fault recovery strategy. In the preferred embodiment, the 
system layer contains non-restartable non-relocatable 
process groups 14. The platform layer may contain ei- 
2 $ ther, or both, of two types of process groups. Platform 
layer process groups 16 may be either: (1 ) non-restart- 
able and non- relocatable; or (2) restartable and non-re- 
locatable. The application layer 17 may contain any, or 
all, of the following three types of process groups: (1) 
30 non-restartable and non-relocatable process groups 1 8; 
(2) restartable and non -relocatable process groups 18; 
and (3) restartable and relocatable process groups 20 
and 22. Relocatable refers to relocating performance of 
the functionality of a primary process group 20 or an al- 
35 ternate process group 22 from one computer 6 to an- 
other computer 6, rather than any type of relocation 
within the same computer 6. Primary process groups 20 
and alternate process groups 22 are relocatable. Proc- 
ess groups 14, 16, and 18 are not relocatable. As pre- 
40 viously mentioned, in the preferred embodiment, a pri- 
mary process group 20 and a secondary process group 
22 can belong to only the application layer 17. Never- 
theless, it will be obvious to those having ordinary skill 
in the art that primary and alternate process groups 20 
45 and 22 could belong to the platform and/or system lay- 
ers and that other permutations and combinations of lay- 
er- based fault recovery and process group activation 
strategies and restartability and relocatability can be im- 
plemented without departing from the scope of this in- 
50 vention. 

[0037] Upon a system layer process group failure: (1 ) 
all system layer platform layer, and application layer 
process groups 14, 16. 18, 20, and 22 on the computer 
6 hosting the failed system layer process group ("host 
55 computer") are taken out of service by transitioning 
them to the Unavailable state 30; (2) for each primary 
and alternate process group 20 and 22 that is transi- 
tioned from Active 36 to Unavailable 30 on the host com- 
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puter, its paired process group 20 or 22 is activated on 
a separate computer 6 by transitioning the paired proc- 
ess group 20 or 22 from Standby 34 to Active 36; and 
(3) the computer 6 that is hosting the failed system layer 
process group 14 is re-booted. In the preferred embod- 
iment, re-booting can be done a pre-determined number 
of times over a pre-determined time period. If re-booting 
host computer 6 does not clear the fault, the independ- 
ent computer 4 will then power cycle the host computer 
6, thereby re-booting the host computer 6. In the pre- 
ferred embodiment, power cycling can be done a pre- 
determined number of times over a pre-determined time 
period. If power cycling the host computer does not does 
not clear the system layer process group failure, the in- 
dependent computer 4 will cut off power to the host com- 
puter 6, which will remain in a powered down state. 
[0038] Upon a platform layer process failure, the 
failed process may be re-started zero or more times. In 
the preferred embodiment, re-startable platform layer 
processes may be re-started a pre-delermined number 
of times over a pre-determined time period, for instance 
three re-starts within five minutes. 
[0039] Upon failure of re-starting the failed platform 
layer process to cure the failure, the process group 16 
containing the failed process may be re-started zero or 
more times. In the preferred embodiment, re-startable 
platform layer process groups 16 may be re-started a 
pre-determined number of times over a pre-determined 
time period. When such a process group 1 6 is restarted, 
it is restarted in its previous running state. 
[0040] If re-starting the failed platform layer process 
group 1 6 does not correct the platform layer process fail- 
ure, fault recovery is escalated to: (1 ) taking all applica- 
tion layer process groups out of service, including acti- 
vating any process groups 20 or 22 on separate com- 
puters that are running Standby 34 and that are paired 
with an application layer process group 20 or 22 taken 
out of service, if any such paired Standby process 
groups 20 or 22 exist; and (2) re-initializing all platform 
layer process groups 16 on the computer hosting the 
failed platform layer process. If the failed platform layer 
process and the process group in which it is contained 
are both non-restartable, then this escalated fault recov- 
ery strategy is the initial fault recovery action taken. This 
escalated platform layer process group fault recovery 
procedure is also implemented upon detection of a fault 
in a resource depended upon by a platform layer proc- 
ess group 16, with the added step of re-initializing the 
resource for which a fault was detected. If re-initializing 
all platform layer process groups 16 on the computer 6 
hosting the failed platform layer process or resource 
("hosting computer") does not cure the failure, hosting 
computer 6 is re-booted, thereby causing all process 
groups 14, 16, 18, 20, and 22 on hosting computer 6 to 
be re-started. In the pref rred embodim nt, hosting 
computer 6 can be re-booted a pre-determined number 
of times over a pre-determined time period. If re-booting 
hosting computer 6 does not cure the failure, independ- 



ent computer 4 will then power cycle hosting computer 
6, thereby re-booting hosting computer 6. In the pre- 
ferred embodiment, hosting computer 6 can be power 
cycled a pre-determined number of times over a pre- 
5 determined time period. If power cycling hosting com- 
puter 6 does not does not clear the platform layer proc- 
ess group failure, independent computer 4 will cut off 
power to hosting computer 6, which will remain in a pow- 
ered down state. 

w [0041] Upon failure of a process in an application lay- 
er process group 18, 20, or 22, the failed process may 
be re-started zero or more times. In the preferred em- 
bodiment, re-startable application layer processes may 
be re-started a pre-determined number of times over a 

*5 pre-determined time period. 

[0042] Upon failure of re-starting the failed application 
layer process to cure the failure, the process group 18, 
20 or 22 containing the failed process may be re-started 
zero or more times. In the preferred embodiment, re- 

20 startable application layer process groups 1 8, 20, and 
22 may be re-started a pre-determined number of times 
over a pre-determined time period. When such a proc- 
ess group 18, 20 or 22 is restarted, it is restarted in its 
previous running state. 

25 [0043] If the failed application layer process group 1 8, 
20, or 22, is an application layer process group 18 ; or a 
primary or alternate process group 20 or 22 running Ac- 
tive 36 and re-starting such a failed process group does 
not correct the application layer process failure, fault re- 

30 covery is escalated to: taking such a failed application 
layer process group out of service by transitioning it to 
the Unavailable state 30, and, for such a primary or al- 
ternate process groups 20 or 22, activating its paired 
standby process group 20 or 22 on a separate computer 

35 6. This escalated fault recovery strategy is the initial fault 
recovery strategy for process faults of non-restartable 
processes contained within non-restartable application 
layer process groups 1 8, 20 and 22 This escalated fault 
recovery strategy is also used upon detection of a fault 

40 in a resource depended upon by an application layer 
process group 18, 20, or 22 running in the Active state 
36, with the added step of re-initializing the resource for 
which a fault was detected. 

[0044] If an application layer process group 1 8, 20, or 
45 22 cannot be taken out of service and transitioned to 
Unavailable 30, fault recovery escalates to re-initializing 
and re-starting all application layer and platform layer 
process groups 16, 18, 20, and 22 on the computer 6 
hosting the failed application layer process group or re- 
50 source ("computer hosting the application layer failure"). 
In the preferred embodiment, such re-initializations and 
re-starts can be performed a pre-determined number of 
times over a pre-determined period on the computer 
hosting the application layer failure 6. If re-initializing 
55 and re-starting all platform layer and application layer 
proc ss groups 16, 18, 20 : and 22 on the computer host- 
ing the application layer failure does not cure the failure, 
the computer hosting the application layer failure is re- 



> 095391 1A2J_> 



6 



11 



EP0 953 911 A2 



12 



booted, thereby causing all process groups 14, 16, 16, 
20, and 22 on the computer hosting the application layer 
failure to be re-started. In the preferred embodiment, the 
computer hosting the application layer failure can be re- 
booted a pre-determined number of times over a pre- 
determined time period. If re-booting the computer host- 
ing the application layer failure does not cure the failure, 
independent computer 4 will then power cycle the com- 
puter hosting the application layer failure, thereby re- 
booting the computer hosting the application layer fail- 
ure. In the prelerred embodiment, the computer hosting 
the application layer failure can be power-cycled a pre- 
determined number of times over a pre-determined time 
period. If power cycling the computer hosting the appli- 
cation layer failure does not clear the failure, independ- 
ent computer 4 will cut off power to the computer hosting 
the application layer failure, which will remain in a pow- 
ered down state. 

[0045] Application layer process groups 18, 20, and 
22 may, or may not, depend upon application layer proc- 
ess groups 18. Therefore, a fault in an application layer 
process group 18, 20 or 22 will not affect the availability 
of any system layer process groups 14 or any platform 
layer process groups 16. A fault in an application layer 
process group 18, 20 or 22 will also not affect the avail- 
ability of any application layer process groups 18, 20 
and 22 that are not dependent upon the failed applica- 
tion layer process group 18, 20, or 22. However, any 
application layer process groups 1 8, 20, and 22 that are 
dependent upon an application layer process group 18 
that is taken out of service, will also be taken out service. 
[0046] If a failed primary or alternate process group 
20 or 22 is running Standby 34 and re-starting such a 
failed process group does not correct the failure, fault 
recovery is escalated to: taking such a failed Standby 
34 primary or alternate process group 20 or 22 out of 
service by transitioning it to the Unavailable state 30. 
Upon inability to take such a failed Standby 34 primary 
or alternate process group 20 or 22 out of service, the 
previously described escalated fault recovery strategy 
that is implemented upon inability to take any application 
layer process group 18, 20, or 22 out of service is im- 
plemented. 

[0047] Upon detection of a fault in a resource depend- 
ed upon by an application layer process group 20, or 22 
running in the Standby state 34 or upon detection of a 
process failure in a non-restartable process group 20 or 
22 running in the Standby state 34, the application layer 
process group 20 or 22 that is dependent upon the failed 
resource, or the process group 20 or 22 for which the 
failure was detected, is transitioned to the Unavailable 
state 30. Upon inability to take such a process group 20 
or 22 out of service, the previously described escalated 
fault recovery strategy that is implemented upon inability 
to take any application layer process group 18, 20, or 
22 out of service is implemented. 
[0048] The process group manager 1 6 can make the 
state of individual process groups 14, 16, 18, 20, and 



22 and critical resources known to either: (1 ) only those 
computer systems 6 hosting the process groups 14, 16, 
18, 20, and 22 to which the state information pertains, 
or (2) all of the loosely coupled computing systems 6. In 
s addition, such state information can be made available 
to external application software. 

[0049] In the preferred embodiment, transitions be- 
tween process group states are controlled by process 
group management software 10 running on each of the 
io loosely coupled computers 6. Nevertheless, such tran- 
sitions could be controlled by process group manage- 
ment software 10 running on any permutation and/or 
combination of the centralized computer 4 and the 
loosely coupled computers 6 without departing from the 
*s scope of this invention. 

[0050] System layer process groups 14 can contain 
either: (1 ) operating system software and services found 
in normal commercially available computing systems; or 
(2) process group management software 10. 
20 [0051] Resources depended upon by platform layer 
process groups 16 are initialized before initializing plat- 
form layer process groups 1 6. Failure to bring a platform 
layer resource in service results in the host computer 6 
being re-booted results in the same fault recovery strat- 
us egy as previously explained for a fault in a resource de- 
pended upon by a platform layer process group 16. 
[0052] During initialization, platform layer process 
groups 16 and application layer process groups 18, 20, 
and 22 are designated runable only when all platform 
30 layer resources and platform layer process groups 16 
are designated runable. 

[0053] Platform layer process groups 16 handshake 
with the process group manager 1 0 to control the startup 
sequence of platform layer process groups 16. Similarly, 
35 application layer process groups 18, 20, and 22 hand- 
shake with the process group manager 1 0 to control the 
startup sequence of application layer process groups 
18, 20, and 22. 

[0054] Application layer process groups 18, 20, and 
40 22 can be put into the Off-line state 38 individually so 
that maintenance or software updates can be performed 
on the Off-line process groups 18, 20, and 22 without 
impacting other process groups on the computer 6 host- 
ing the Off-line process groups 18, 20, and 22. 
45 [0055] Primary and alternate process groups 20 and 
22 within the same process-group pair 24 may have a 
shared resource dependency Process-group pairs 24 
that have an active cold-standby paired relationship typ- 
ically provide high availability. Process-group pairs 24 
50 that have an active hot-standby paired relationship typ- 
ically provide very high availability. 
[0056] It will be obvious to those having ordinary skill 
in the art that primary and alternate process groups 
could be arranged in a lead-active/active paired relation- 
's ship, analogous to the Active 36/ Standby 34 r lation- 
ship described above, without departing from the scope 
of this invention. Such lead-active/active process-group 
pairs typically provide ultra-high availability. 
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[0057] Application layer resources on a single com- 
puting system are initialized before initializing primary 
or alternate process groups 20 or 22. Failure to bring a 
critical application layer resource in service results in 
taking the dependent application layer process group 
18, 20, or 22 out of service by transitioning it to the Un- 
available state 30, and activating the dependent proc- 
ess group's 20, or 22 paired process group 20 or 22, if 
one exists, on a separate computer 6. 
[0058] Activated platform layer and application layer 
process groups 16, 18, 20, and 22 handshake with the 
process group manager 10 to acknowledge activation 
to Active 36 or Standby 34. 

[0059] Process-group pairs 24 can be provided only 
when platform layer process groups 16 are operating 
normally on the computers 6 hosting both the primary 
process group 20 and the alternate process group 22 
that are be contained in the process-group pair 24. 
[0060] Upon a primary or alternate process group's 
20 or 22 failure to acknowledge initiation of activation 
by the process group manager 10, the primary or alter- 
nate process group 20 or 22 for which activation was 
initiated is transitioned to the Unavailable state 30. Sim- 
ilarly, upon an application layer primary or alternate 
process group's 20 or 22 failure to acknowledge initia- 
tion of de-activation from Active 36 to Standby 34, the 
primary or alternate process group 20 or 22 for which 
de-activation was initiated is transitioned to the Unavail- 
able state 30. 

[0061 ] Although initiated, activation of a standby proc- 
ess group 20 or 22 does not occur until the Standby 
process group 20 or 22 for which activation has been 
initiated acknowledges initiation of the activation by 
handshaking with the process group manager 10. 
[0062] Primary and alternate process groups 20 and 
22 belonging to the same process-group pair 24 can be 
put in the Off-line state 38 for maintenance without im- 
pacting other process groups 18, 20, and 22, on any of 
the loosely coupled computers 6. 
[0063] Separate application layer process groups 1 8, 
20, and 22, running on the same computer 6, or on dif- 
ferent computers 6, can host dissimilar external appli- 
cations, such as one or more application layer process 
groups 18, 20. or 22 controlltno _ :de Division Multiple 
Access (CDMA) cellular telephc- : call processing, one 
or more process groups 18, 20, or 22 controlling Time 
Division Multiple Access (TDM A) cellular telephone call 
processing, one or more process groups 1 8, 20, or 22 
controlling Group Special Mobile (GSM) cellular tele- 
phone call processing, one or more process groups 18, 
20, or 22 controlling Cellular Digital Packet Data (CDPD) 
cellular telephone call processing, and one or more 
process groups 18, 20, or 22 controlling Analog Mobile 
Phone Service (AMPS) cellular telephone call process- 
ing. Similarly, separate application lay r process groups 
18, 20, and 22, running on the same computer 6, or on 
different computers 6, can host dissimilar external ap- 
plications. 



[0064] As will be obvious to those having ordinary skill 
in the art, this invention provides the flexibility to config- 
ure one or more process-group pairs 24 across two or 
more loosely coupled computers 6 in several high avail- 
s ability computing element 6 configurations including ac- 
tive/standby, active/active, and a typical n+k sparing ar- 
rangement. 



10 Claims 

1 . A method for providing high availability applications 
comprising, in combination, the steps of: 

15 A. running, on at least two computers, one or 

more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

20 

B running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

25 

C. providing, on at least a first of said at least 
two computers, a system layer having at least 
one process group; 

30 D taking all of said process groups out of serv- 

ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

35 E. re-booting said first computer upon a system 

layer process group fault occurring on said first 
computer; 

R providing at least one paired process group 
40 on a second of said at least two computers, said 

paired process group being paired with one of 
said one or more process groups on said first 
computer; and 

45 G. activating said paired process group on said 

second computer upon a system layer process 
group fault occurring on said first computer. 

2. The method of claim 1 further comprising, in com- 
50 bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
system layer process group fault. 

55 3. The method of claim 2 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 
of said power cycling of said first computer to cure 



20 



25 



40 
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said system layer process group fault. 

4. A method for providing high availability applications 
comprising, in combination, the steps of: 

5 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 10 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 

of said one or more process groups; is 

C. providing, on at least a first of said at least 
two computers, a system layer having at least 
one process group; 

20 

D. taking all of said process groups out of serv- 
ice on said first computer upon a fault in a re- 
source depended upon by at least one of said 
system layer process groups on said first com- 
puter; 25 

E. re-booting said first computer upon said fault 
in said resource depended upon by at least one 
of said system layer process groups on said 
first computer; 30 

R providing at least one paired process group 
on a second of said at least two computers, said 
paired process group being paired with one of 
said process groups on said first computer; and 35 

G. activating said paired process group on said 
second computer upon said fault in said re- 
source depended upon by at least one of said 
system layer process groups on said first com- 40 
puter. 

5. The method of claim 4 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 45 
of said re-booting of said first computer to cure said 
fault in said resource depended upon by at least one 

of said system layer process groups on said first 
computer. 

so 

6. The method of claim 5 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said fault in said resource d pended upon by at 55 
least one of said system layer process groups on 
said first computer. 



7. Am thod for providing high availability applications 
comprising, in combination, the steps of: 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least a first of said at least 
two computers, a system layer having at least 
one process group; 

D. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

E. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; 

R providing, on at least said first computer, a 
platform layer having at least one process 
group; f 

G. taking all of said process groups, except 
each of said at least one process group in said 
system layer, out of service on said first com- 
puter upon a platform layer process group fault 
occurring on said first computer; 

H. providing at least one paired process group 
on a second of said at least two compute rs , said 
paired process group being paired with one of 
said process groups on said first computer; and 

I. activating said paired process group on said 
second computer upon said platform layer 
process group fault occurring on said first com- 
puter. 

8. The method of claim 7 further comprising, in com- 
bination, the step of: re -initializing all of said process 
groups, except each of said at least one process 
group in said system layer, on said first computer 
upon said platform layer process group fault occur- 
ring on said first computer. 

9. The method of claim 8 further comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initialization to cure said plat- 
form layer process group fault. 
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10. The method of claim 9 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
platform layer process group fault. 

1 1 . The method of claim 1 0 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 



A. running, on at (east two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least a first of said at least 
two computers, a system layer having at least 
one process group; 

D. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 



H. providing at least one paired process group 
on a second of said at least two computers, said 
at least one paired process group being paired 
with one of said process groups on said first 
computer; and 

I. activating said at least one paired process 
group on said second computer upon said fault 



in said resource depended upon by at least one 
of said platform layer process groups on said 
first computer. 

5 13. The method of claim 1 2 further comprising, in com- 
bination, the steps of: 

A. re-initializing said resource having said fault; 
and 

B. re-initializing all of said process groups, ex- 
cept each of said at least one process group in 
said system layer, on said first computer upon 
said fault in said resource depended upon by 
at least one of said platform layer process 
groups on said first computer. 

14. The method of claim 1 3 further comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initializations to cure said 
fault in said resource depended upon by at least one 
of said platform layer process groups on said first 
computer. 

25 is. The method of claim 1 4 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
fault in said resource depended upon by at least one 
30 of said platform layer process groups on said first 
computer 

16. The method of claim 1 5 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said fault in said resource depended upon by at 
least one of said platform layer process groups on 
said first computer 

17. A method for providing high availability applications 
comprising, in combination, the steps of: 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least a first of said at least 
two computers, a system layer having at least 
one process group; 



of said power cycling of said first computer to cure m> 
said platform layer process group fault. 

1 2. A method for providing high availability applications 
comprising, in combination, the steps of: 

15 



E. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; 

F. providing, on at on at least a first of said at 40 
least two computers, a platlorm layer having at 
least one process group; 

G. taking all of said process groups, except 
each of said at least one process group in said 45 
system layer, out of service on said first com- 
puter upon a fault in a resource depended upon 
by at least one of said platform layer process 
groups on said first computer; 
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D. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

5 

E. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; 

F. providing, on at least a first of said at least 10 
two computers, a platform layer having at least 
one process group; 

G. restarting at least one of said platform layer 
process groups upon a platform layer process is 
group fault occurring on said first computer; 

H. taking all of said process groups, except 
each of said at least one process group in said 
system layer, out of service on said first com- 20 
puter upon failure of said re-start to cure said 
platform layer process group fault; 

I. providing at least one paired process group 

on a second of said at least two computers, said 2s 
at least one paired process group being paired 
with one of said process groups on said first 
computer; and 

J. activating said at least one paired process 30 
group on said second computer upon failure of 
said re-start to cure said platform layer process 
group fault. 

18. The method of claim 17 further comprising, in com- 35 
bination, the step of: re-initializing all of said process 
groups, except each of said at least one process 
group in said system layer, on said first computer 
upon failure of said re-start to cure said platform lay- 
er process group fault. 40 

19. The method of claim 18 further comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initialization to cure said plat- 
form layer process group fault. 4S 

20. The method of claim 19 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 

of said re- booting of said first computer to cure said so 
platform layer process group fault. 

21. The method of claim 20 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 55 
of said power cycling of said first computer to cure 
said platform layer process group fault. 



22. A method for providing high availability applications 
comprising, in combination, the steps of: 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least said first computer, a 
system layer having at least one process group; 

D. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

E. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; 

F. providing, on at least a first of said at least 
two computers, an application layer having at 
least one process group; 

G. taking said at least one application layer 
process group out of service on said first com- 
puter upon a fault in said at least one applica- 
tion layer process group on said first computer; 

H. providing at least one paired process group 
on a second of said at least two computers, said 
paired process group being paired with one of 
said at least one application layer process 
group taken out of service on said first compu- 
ter; 

I. activating said paired process group on said 
second computer upon said fault in said at least 
one application layer process group on said first 
computer; and 

J. re-initializing all of said process groups, ex- 
cept each of said at least one process group in 
said system layer, on said first computer upon 
not being able to take said application layer 
process group having said fault out of service 
on said first computer. 

23. The method of claim 22 further comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initialization to cure said ap- 
plication layer process group fault. 
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24. The method of claim 23 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
application layer process group fault. 

25. The method of claim 24 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said application layer process group fault. 

26. A method for providing high availability applications 
comprising, in combination, the steps of: 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least a first of said at least 
two computers, an application layer having at 
least two process groups; 

D. defining a dependency by at least a first of 
said at least two application layer process 
groups upon at least a second of said at least 
two application layer process groups; 

E. taking said first and said second application 
layer process groups out of service on said first 
computer upon a fault in said second applica- 
tion layer process group on said first computer; 

R providing at least one paired process group 
on a second of said at least two computers, said 
paired process group being paired with said 
second application layer process group on said 
first computer; and 

G. activating said paired process group on said 
second computer upon said fault in said second 
application layer process group on said first 
computer. 

27. The method of claim 26 further comprising, in com- 
bination, the steps of: 

A. providing, on at least said first computer, a 
system layer having at least one process group; 

B. taking all of said process groups out of serv- 



ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

s C. re-booting said first computer upon a system 

layer process group fault occurring on said first 
computer; and 

D. re-initializing all of said process groups, ex- 
10 cept each of said at least one process group in 

said system layer, on said first computer upon 
not being able to take said first application layer 
process group out of service on said first com- 
puter. 

15 

28. The method of claim 27 lurther comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initialization to cure said fault 
in said second application layer process group. 

20 

29. The method of claim 28 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 

2S fault in said second application layer process group. 

30. The method of claim 39 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first- computer upon failure 

30 of said power cycling of said first computer to cure 
said fault in said second application layer process 
group. 

31 . A method for providing high availability applications 
35 comprising, in combination, the steps of: 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 

^0 that have a fault recovery strategy common to 

said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 

45 tiates a fault recovery strategy for at least one 

of said one or more process groups; 

C. providing, on at least a first of said at least 
two computers, an application layer having at 

50 least one process group; 

D. taking said at least one application layer 
process group out of service on said first com- 
puter upon a fault in a resource depended upon 

55 by said at least one application layer process 

group; 

E. providing at least one paired process group 
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on a second of said at least two computers, said 
paired process group being paired with one of 
said at least one application layer process 
group on said first computer: and 

5 

F. activating said paired process group on said 
second computer upon said fault in said re- 
source depended upon by said at least one ap- 
plication layer process group. 

10 

32. The method of claim 31 further comprising, in com- 
bination, the steps of: 

A. providing, on at least said first computer, a 
system layer having at least one process group; is 

B. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 20 

C. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; 

25 

D. re-initializing said resource having said fault; 
and 

E. re-initializing all of said process groups, ex- 
cept each of said at least one process group in 30 
said system layer, on said first computer upon 

not being able to take said at least one applica- 
tion layer process group out of service on said 
first computer. 

35 

33. The method of claim 32 further comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initializations to cure said 
fault in said resource depended upon by said at 
least one application layer process group. 40 

34. The method of claim 33 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 

of said re-boot of said first computer to cure said ^5 
fault in said resource depended upon by said at 
least one application layer process group. 

35. The method of claim 34 further comprising, in com- 
bination, the step of: using an independent compu- so 
ter to power down said first computer upon failure 

of said power cycling of said first computer to cure 
said fault in said resource depended upon by said 
at least one application layer process group. 

55 

36. A method for providing high availability applications 
comprising, in combination, the steps of: 



A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups: 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least said first computer, a 
system layer having at least one process group; 

D. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

E. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; 

R providing, on at least a first of said at least 
two computers, an application layer having at 
least one process group; 

G. re-starting said at least one application layer 
process group on said first computer upon a 
fault in said at least one application layer proc- 
ess group; 

H. taking said at least one application layer 
process group out of service on said first com- 
puter upon failure of said re-start to cure said 
fault in said at least one application layer proc- 
ess group; i-> 

I. providing at least one paired process group 
on a second of said at least two computers, said 
paired process group being paired with one of 
said at least one application layer process 
group taken out of service on said first compu- 
ter; 

J. activating said paired process group on said 
second computer upon failure of said re- start 
to cure said fault in said at least one application 
layer process group taken out of service on said 
first computer; and 

K. re-initializing all of said process groups, ex- 
cept each of said at least one process group in 
said system layer, on said first computer upon 
not being able to take said application layer 
process group having said fault out of service 
on said first computer. 
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37. The method of claim 36 further comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initialization to cure said ap- 
plication layer process group fault. 

38. The method of claim 37 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
application layer process group fault. 

39. The method of claim 38 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said application layer process group fault. 

40. A method for providing high availability applications 
comprising, in combination, the steps of: 

A. running, on at least two computers, one or 
more process groups, at least one of said proc- 
ess groups containing one or more processes 
that have a fault recovery strategy common to 
said at least one of said process groups; 

B. running, on at least one of said at least two 
computers, a process group manager that ini- 
tiates a fault recovery strategy for at least one 
of said one or more process groups; 

C. providing, on at least a first of said at least 
two computers, an application layer having at 
least two process groups; 

D. defining a dependency by at least a first of 
said at least two application layer process 
groups upon at least a second ol said at least 
two application layer process groups; 

E. re-starting said second application layer 
process group on said first computer upon a 
fault in said second application layer process 
group; 

F. taking said first and said second application 
layer process groups out of service on said first 
computer upon failure of said re-start to cure 
said fault in said second application layer proc- 
ess group; 

G. providing at least one paired process group 
on a second of said at least two computers, said 
paired process group being paired with said fist 
application layer process group on said first 
computer; and 

H. activating said at least one paired process 



group on said second computer upon failure of 
said re-start to cure said fault in said second 
application layer process group. 

5 41 . The method of claim 40 further comprising, in com- 
bination, the steps of: 

A. providing, on at least said first computer, a 
system layer having at least one process group; 

10 

B. taking all of said process groups out of serv- 
ice on said first computer upon a system layer 
process group fault occurring on said first com- 
puter; 

15 

C. re-booting said first computer upon a system 
layer process group fault occurring on said first 
computer; and 

20 D. re-initializing all of said process groups, ex- 

cept each of said at least one process group in 
said system layer, on said first computer upon 
not being able to take said first application layer 
process group out of service on said first com- 

25 puter. 

42. The method of claim 41 lurther comprising, in com- 
bination, the step of: re-booting said first computer 
upon failure of said re-initialization to cure said fault 

30 in said second application layer process group. 

43. The method of claim 42 further comprising, in com- 
bination, the step of: using an independent compu- 
ter to power cycle said first computer upon failure 

35 of said re-boot of said first computer to cure said 
fault in said second application layer process group. 

44. The method of claim 43 further comprising, in com- 
bination, the step of: using an independent compu- 
te ter to power down said first computer upon failure 

of said power cycling of said first computer to cure 
said fault in said second applicatbn layer process 
group. 

45 45. An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
50 of said process groups containing one or more 

processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

55 B. means for running, on at least one of said at 

least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 
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C. means for providing, on at least a first of said 
at least two computers, a system layer having 
at least one process group; 

D. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 
first computer; 

E. means for re-booting said first computer up- 
on a system layer process group fault occurring 
on said first computer; 

R means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with one of said one or more process groups 
on said first computer; and 

G. means for aclivaling said paired process 
group on said second computer upon a system 
layer process group fault occurring on said first 
computer. 

46. The apparatus of claim 45 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
system layer process group fault. 

47. The apparatus of claim 46 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said system layer process group fault. 

48. An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 
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B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; so 

C. means for providing, on at least a first of said 
at least two computers, a system layer having 
at least one process group; 



55 



D. means for taking all of said process groups 
out of service on said first computer upon a fault 
in a resource depended upon by at least one of 



said system lay r process groups on said first 
computer; 

E. means for re-booting said first computer up- 
on said fault in said resource depended upon 
by at least one of said system layer process 
groups on said first computer; 

F. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with one of said process groups on said first 
computer; and 

G. means for activating said paired process 
group on said second computer upon said fault 
in said resource depended upon by at least one 
of said system layer process groups on said 
first computer. 

49. The apparatus of claim 48 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
fault in said resource depended upon by at least one 
of said system layer process groups on said first 
computer. 

50. The apparatus of claim 49 further comprising, in 
combination: means for using an independent com- * 
puterto power down said first computer upon failure 
of said power cycling of said first computer to cure 
said fault in said resource depended upon by at 
least one of said system layer process groups on 
said first computer 

51 . An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput-. 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least a first of said 
at least two computers, a system layer having 
at least one process group; 

D. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 
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first computer; 

E. means for re-booting said first computer up- 
on a system layer process group fault occurring 
on said first computer; 

F. means for providing, on at least said first 
computer, a platform layer having at least one 
process group; 

G. means for taking all of said process groups, 
except each of said at least one process group 
in said system layer, out of service on said first 
computer upon a platform layer process group 
fault occurring on said first computer; 

H. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with one of said process groups on said first 
computer; and 

I. means for activating said paired process 
group on said second computer upon said plat- 
form layer process group fault occurring on said 
first computer. 

52. The apparatus of claim 51 further comprising, in 
combination: means for re-initializing all of said 
process groups, except each of said at least one 
process group in said system layer, on said first 
computer upon said platform layer process group 
fault occurring on said first computer. 

53. The apparatus of claim 52 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re-initialization to cure said 
platform layer process group fault. 

54. The apparatus of claim 53 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
platform layer process group fault. 

55. The apparatus of claim 54 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said platform layer process group fault. 

56. An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or mor process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 



common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
5 least two computers, a process group manager 

that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least a first of said 
10 at least two computers, a system layer having 

at least one process group; 

D. means for taking all of said process groups 
out of service on said first computer upon a sys- 

15 tern layer process group fault occurring on said 

first computer; 

E. means for re-booting said first computer up- 
on a system layer process group fault occurring 

20 on said first computer; 

F. means for providing, on at least a first of said 
at least two computers, a platform layer having 
at least one process group; 

25 

G. means for taking all of said process groups, 
except each of said at least one process group 
in said system layer, out of service on said first 
computer upon a fault in a resource depended 

30 upon by at least one of said platform layer proc- 

ess groups on said first computer; 

H. means for providing at least one paired proc- 
ess group on a second of said at least two com- 

35 puters, said at least one paired process group 

being paired with one of said process groups 
on said first computer; and 

I. means for activating said at least one paired 
40 process group on said second computer upon 

said fault in said resource depended upon by 
at least one of said platform layer process 
groups on said first computer 

45 57. The apparatus of claim 56 further comprising, in 
combination: 

A. means for re-initializing said resource having 
said fault; and 

so 

B. means for re-initializing all of said process 
groups, except each of said at least one proc- 
ess group in said system layer, oh said first 
computer upon said fault in said resource de- 

ss pended upon by at least one of said platform 

lay r process groups on said first computer. 

58. The apparatus of claim 57 further comprising, in 
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combination: means for re-booting said first compu- 
ter upon failure of said re-initializations to cure said 
fault in said resource depended upon by at least one 
of said platform layer process groups on said first 
computer. 

59. The apparatus of claim 58 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
fault in said resource depended upon by at least one 
of said platform layer process groups on said first 
computer. 

60. The apparatus of claim 59 further comprising, in 
combination: means for using an independent com- 
puter to power down said, first computer upon fail- 
ure of said power cycling of said first computer to 
cure said fault in said resource depended upon by 
at least one of said platform layer process groups 
on said first computer. 

61 . An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least a first of said 
at least two computers, a system layer having 
at least one process group; 

D. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process c r :up fault occurring on said 
first computer; * 45 

E. means for re-booting said first computer up- 
on a system layer process group fault occurring 
on said first computer; 

R means for providing, on at least a first of said 
at least two computers, a platform layer having 
at least one process group; 

G. means for restarting at least one of said plat- ss 
form layer process groups upon a platform lay- 
er process group fault occurring on said first 
computer; 
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H. means for taking all of said process groups, 
except each of said at least one process group 
in said system layer, out of service on said first 
computer upon failure of said re-start to cure 
said platform layer process group fault; 

I. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said at least one paired process group 
being paired with one of said process groups 
on said first computer; and 

J. means for activating said at least one paired 
process group on said second computer upon 
failure of said re-start to cure said platform layer 
process group fault. 

62. The apparatus of claim 61 further comprising, in 
combination: means for re-initializing all of said 
process groups, except each of said at least one 
process group in said system layer, on said first 
computer upon failure of said re-start to cure said 
platform layer process group fault. 

63. The apparatus of claim 62 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re- initialization to cure said 
platform layer process group fault. 

64. The apparatus of claim 63 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-booting of said first computer to cure said 
platform layer process group fault. 

65. The apparatus of claim 64 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said platform layer process group fault. 

66. An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least said first 
computer, a system layer having at least one 
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tions comprising, in combination: 



D. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 
first computer; 

E. means for re-booting said first computer up- 
on a system layer process group fault occurring 
on said first computer; 

F. means for providing, on at least a first of said 
at least two computers, an application layer 
having at least one process group; 

G. means for taking said at least one applica- 
tion layer process group out of service on said 
first computer upon a fault in said at least one 
application layer process group on said first 
computer; 

H. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with one of said at least one application layer 
process group taken out of service on said first 
computer; 

I. means for activating said paired process 
group on said second computer upon said fault 
in said at least one application layer process 
group on said first computer: and 

J. means for re-initializing all of said process 
groups, except each of said at least one proc- 
ess group in said system layer, on said first 
computer upon not being able to take said ap- 
plication layer process group having said fault 
out of service on said first computer. 

67. The apparatus of claim 66 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re-initialization to cure said 
application layer process group fault. 

68. The apparatus of claim 67 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
application layer process group fault. 

69. The apparatus of claim 68 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said application layer process group fault. 

70. An apparatus for providing high availability applica- 
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A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least a first of said 
at least two computers, an application layer 
having at least two process groups; 

D. means for defining a dependency by at least 
a first of said at least two application layer proc- 
ess groups upon at least a second of said at 
least two application layer process groups; 

E. means for taking said first and said second 
application layer process groups out of service 
on said first computer upon a fault in said sec- 
ond application layer process group on said first 
computer; 

R means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with said second application layer process 
group on said first computer; and 

G. means for activating said paired process 
group on said second computer upon said fault 
in said second application layer process group 
on said first computer. 

The apparatus of claim 70 further comprising, in 
combination: 

A. means for providing, on at least said first 
computer, a system layer having at least one 
process group; 

B. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 
first computer; 

C. moans for re-booting said first computer up- 
on a system layer process group fault occurring 
on said first computer; and 

D. means for re-initializing all of said process 
groups, except each of said at (east one proc- 
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ss group in said system layer, on said first 
computer upon not being able to take said first 
application layer process group out of service 
on said first computer. 

72. The apparatus of claim 71 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re-initialization to cure said 
fault in said second application layer process group. 

73. The apparatus of claim 72 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
fault in said second application layer process group. 

74. The apparatus of claim 73 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said fault in said second application layer process 
group. 

75. An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least a first of said 
at least two computers, an application layer 
having at least one process group; 

D. means for taking said at least one application 
layer process group out of service on said first 
computer upon a fault in a resource depended 
upon by said at least one application layer proc- 
ess group; 

E. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with one of said at least one application layer 
process group on said first computer; and 

R means for activating said paired process 
group on said second computer upon said fault 
in said resource depended upon by said at least 
one application layer process group. 
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76. The apparatus of claim 75 further comprising, in 
combination: 

A. means for providing, on at least said first 
computer, a system layer having at least one 
process group; 

B. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 
first computer; 

C. means for re-booting said first computer up- 
on a system layer process group fault occurring 
on said first computer; 

D. means for re-initializing said resource hav- 
ing said fault; and 

E. means for re-initializing all of said process 
groups, except each of said at least one proc- 
ess group in said system layer, on said first 
computer upon not being able to take said at 
least one application layer process group out 
of service on said first computer. 

77. The apparatus of claim 76 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re-initializations to cure said 
fault in said resource depended upon by said at 
least one application layer process group. 

78. The apparatus of claim 77 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
fault in said resource depended upon * by said at 
least one application layer process group. 

79. The apparatus of claim 76 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said fault in said resource depended upon by said 
at least one application layer process group. 

80. An apparatus for providing high availability applica- 
tions comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or mor 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
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that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least said first 
computer, a system layer having at least one $ 
process group; 

D. means for taking all of said process groups 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 10 
first computer; 

E. means for re-booting said first computer up- 
on a system layer process group fault occurring 

on said first computer; is 

R means for providing, on at least a first of said 
at least two computers, an application layer 
having at least one process group; 

20 

G. means for re-starting said at least one appli- 
cation layer process group on said first compu- 
ter upon a fault in said at least one application 
layer process group; 

25 

H. means for taking said at least one application 
layer process group out of service on said first 
computer upon failure of said re-start to cure 
said fault in said at least one application layer 
process group; 30 

I. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with one of said at least one application layer 35 
process group taken out of service on said first 
computer; 

J. means for activating said paired process 
group on said second computer upon failure of 40 
said re-start to cure said fault in said at least 
one application layer process group taken out 
of service on said first computer; and 

K. means for re-initiais/ ^ all of said process 45 
groups, except each ol said at least one proc- 
ess group in said system layer, on said first 
computer upon not being able to take said ap- 
plication layer process group having said fault 
out of service on said first computer. so 

81. The apparatus of claim 80 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re-initialization to cure said 
application layer process group fault. ss 

82. The apparatus of claim 81 further comprising, in 
combination: means for using an ind pendent com- 



puter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
application layer process group fault. 

83. The apparatus of claim 82 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 
of said power cycling of said first computer to cure 
said application layer process group fault. 

84. A method for providing high availability applications 
comprising, in combination: 

A. means for running, on at least two comput- 
ers, one or more process groups, at least one 
of said process groups containing one or more 
processes that have a fault recovery strategy 
common to said at least one of said process 
groups; 

B. means for running, on at least one of said at 
least two computers, a process group manager 
that initiates a fault recovery strategy for at least 
one of said one or more process groups; 

C. means for providing, on at least a first of said 
at least two computers, an application layer 
having at least two process groups; 

D means for defining a dependency by at least 
a first of said at least two application layer proc- 
ess groups upon at least a second of said at 
least two application layer process groups; 

E. means for re-starting said second applica- 
tion layer process group on said first computer 
upon a fault in said second application layer 
process group; 

F. means for taking said first and said second 
application layer process groups out of service 
on said first computer upon lailure of said re- 
start to cure said fault in said second applica- 
tion layer process group; 

G. means for providing at least one paired proc- 
ess group on a second of said at least two com- 
puters, said paired process group being paired 
with said fist application layer process group on 
said first computer; and 

H. means for activating said at least one paired 
process group on said second computer upon 
failure of said re-start to cure said fault in said 
second application layer process group. 

85. The apparatus of claim 84 further comprising, in 
combination: 
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A. means for providing, on at least said first 
computer, a system layer having at least one 
process group; 

B. means for taking all of said process groups 5 
out of service on said first computer upon a sys- 
tem layer process group fault occurring on said 
first computer; 

C. means for re-booting said first computer up- 10 
on a system layer process group fault occurring 

on said first computer; and 

D. means for re-initializing all of said process 
groups, except each of said at least one proc- 1$ 
ess group in said system layer, on said first 
computer upon not being able to take said first 
application layer process group out of service 

on said first computer. 

20 

86. The apparatus of claim 85 further comprising, in 
combination: means for re-booting said first compu- 
ter upon failure of said re-initialization to cure said 
fault in said second application layer process group. 

25 

87. The apparatus of claim 36 further comprising, in 
combination: means for using an independent com- 
puter to power cycle said first computer upon failure 
of said re-boot of said first computer to cure said 
fault in said second application layer process group. 30 

88. The apparatus of claim 87 further comprising, in 
combination: means for using an independent com- 
puter to power down said first computer upon failure 

of said power cycling of said first computer to cure 35 
said fault in said second application layer process 
group. 
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(54) Method and apparatus for providing scaleable levels of application availability 



(57) A method and an apparatus for providing scal- 
able layers of highly available applications using loosely 
coupled commercially available computers. The soft- 
ware running on the loosely coupled computers is divid- 
ed into three layers: the system layer, the platform layer, 
and the application layer, each having its own process 
group activation and fault recovery strategy. A process 
group contains software processes that depend upon a 
set of resources common to the process group. In addi- 
tion lo depending upon a common set of resources, 
processes within a process group share a fault recovery 
strategy. Fault recovery is performed at the process 
group level, such that if one process within a process 
group fails, fault recovery is takes place for all processes 
within the process group. In the preferred embodiment, 
an application layer process group may be paired with 
another application layer process group on a separate 
computer. As part of certain escalat d process group 
fault recovery strategies, upon taking an application lay- 
er process group out of service : its paired application 
layer process group, if any exists, takes over performing 
the functions of the process group that was taken out of 
service. 
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